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Abstract 

Using basic properties of p-adic numbers, we consider a simple 
new approach to describe main aspects of DNA sequence and genetic 
code. Central role in our investigation plays an ultrametric p-adic 
information space which basic elements are nucleotides, codons and 
genes. We show that a 5-adic model is appropriate for DNA sequence. 
This 5-adic model, combined with 2-adic distance, is also suitable for 
genetic code and for a more advanced employment in genomics. We 
find that genetic code degeneracy is related to the p-adic distance 
between codons. 



1 Introduction 

It is well known that practically all genetic information in living systems is 
contained in the desoxyribonucleic acid (DNA) sequence. The DNA macro- 
molecules are made of two polynucleotide chains with a double-helical struc- 
ture. There are four nucleotides called: adenine (A), guanine (G), cytosine 
(C) and thymine (T). A and G belong to purine, while C and T to pyrimidine. 
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The DNA is packaged into chromosome which is localized in the nucleus of 
the eukaryotic cells. One of the basic processes within DNA is its replication. 
The passage of its gene information to protein, called gene expression, per- 
forms by the messenger ribonucleic acid (mRNA), which is usually a single 
polynucleotide chain. In the first part of this process, known as transcription, 
the nucleotides A, G, C, T from DNA are respectively transcribed into the nu- 
cleotides U, C, G, A of mRNA, i.e. T is replaced by U, where U is the uracil. 
The next step is translation, when mRNA codon information is translated 
into synthesis of proteins. Codons are ordered sequences of three nucleotides 
of the A, G, C, U. Protein synthesis in all eukaryotic cells performs in the 
cytoplasm. The genes by their codons control amino-acid sequences in pro- 
teins. It is obvious that there are 4 x 4 x 4 = 64 possible codons. However 61 
of them specify the 20 different amino-acids and 3 correspond to stop-codons, 
which serve as termination signals. As a result most amino-acids are encoded 
by more than one codon. This degenerate correspondence between codons 
and amino-acids is known as genetic code, which is mostly universal for all 
living organisms. In almost all cells genetic information flows from DNA to 
RNA to protein. For a detail and comprehensive information on molecular 
biology aspects of DNA, RNA and genetic code one can see Ref. pQ. 

Processes within macromolecules can be regarded as quantum as classical 
depending on the scale we are interested in. Modeling of DNA, RNA and 
genetic code is a challenge as well as a chance for modern mathematical 
physics. An interesting model based on the quantum algebra U q (sl{2)(& sl(2)) 
in the q — > limit was proposed as a symmetry algebra for the genetic code 
(see and references therein). In a sense this approach mimics quark 
model of baryons. To describe correspondence between codons and amino- 
acids, it was constructed an operator which acts on the space of codons and 
its eigenvalues are related to amino-acids. Besides some successes of this 
approach, there is a problem with rather many parameters in the operator. 

There are some very complex systems (e.g. spin glasses and some macro- 
molecules) whose space of states has an ultrametric structure. The space of 
conformational states of proteins is such one. Processes on ultrametric spaces 
usually need new methods for their description. p-Adic models with pseu- 
dodifferential operators have been successfully applied to interbasin kinetics 
of proteins [I], [5], [S] (for a brief review see [Z]). Ultrametricity is a suitable 
mathematical concept and a tool for description of systems with hierarchical 
structure. The first field of science where ultrametricity observed was taxon- 
omy. The first review of ultrametricity in physics and biology was presented 
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twenty years ago jSJ- A very significant and promising part of ultrametrics 
is p-adics. 

p-Adic numbers are discovered at the end of the 19th century by German 
mathematician Kurt Hensel. They have been successfully employed in many 
parts of mathematics. Since 1987 they have been also used in construction 
of various physical models, especially in string theory, quantum mechanics, 
quantum cosmology and dynamical systems (for a review, see |H] and jlUj). 
Some p-adic aspects of cognitive, psychological and social phenomena have 
been also considered jTT]. The present status of application of p-adic numbers 
in physics and related branches of sciences is reflected in the proceedings of 
the 2nd International Conference on p-Adic Mathematical Physics |12j . 

A p-adic approach to genetics has not been tempted so far. The main aim 
of this paper is to make the first step towards p-adic genomics. Starting with 
a formulation of p-adic genetic information space, we propose a 5-adic model 
for DNA (and RNA) sequences and genetic code. A central mathematical 
tool to analyze classification of codons and structure of genetic code is p-adic 
distance between codons. 

2 £>-Adic numbers 

Recall that numerical results of measurements in experiments and observa- 
tions are rational numbers. The set of all rational numbers Q, having usual 
properties of summation and multiplication, is algebraically a field. In ad- 
dition to arithmetic operations it is often important to know also a distance 
between numbers. Distance can be defined by a norm. On Q there are two 
kinds of nontrivial norm: usual absolute value | • | ^ and p-adic absolute value 
| • \p , where p is any prime number. The usual absolute value is well known 
from elementary courses of mathematics and the corresponding distance be- 
tween two real numbers x and y is d 00 (x,y) = \x — y^. This distance also 
enables that all infinite decimal expansions of real numbers 

— oo 

x = ±10 n ^a fc 10\ a k G {0, l,---,9}, a + 0, n G Z (1) 

k=0 

are convergent. 

By definition, p-adic norm of a rational number ^ x = p u -, where 
v G Z, and integers r and s are not divisible by given prime number p, is 
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\x\ p = p~ L ', and |0| p = 0. This norm is a mapping from Q into non-negative 
real numbers and has the following properties: 

(i) \x\ p > 0, \x\ p = if and only if x — 0, 

(ii) \xy\ p = \x\ p \y\ p , 

(hi) \x + y\ p < max {\x\ p , \y\ p } < \x\ p + \y\ p for aU x , y G Q. 
Because of the strong triangle inequality \x + y\ p < max{ \x\ p , \y\ p } p-adic 
absolute value belongs to non- Archimedean (or ultrametric) norm. 

p-Adic distance between two rational numbers x and y is 

d p (x ,y) = \x - y\ p . (2) 

Since p-adic absolute value is ultrametric, the p-adic distance (0) is also 
ultrametric, i.e. it satisfies 

d p (x , y) < max {d p (x , z) , d p (z ,y)} < d p (x , z) + d p (z , y) , (3) 

where x, y and z are any three points of a p-adic space. 

In direct analogy with the field R of real numbers, the field Q p of p-adic 
numbers can be introduced by completion of Q with respect to the distance 
(J2J) • Note tat for each prime p there is one Q p . Any x G Q p has a unique 
expansion 

+oo 

x = p m ^a k p k , a k G {0, l,---,p- 1} , a ^ 0, (4) 

k=0 

where m is an ordinary integer. 

In this paper we use only p-adic integers for which m = 0, 1, 2, 
For a simple introduction into p-adic numbers one can see book [T3j . 

3 ^>-Adic Genetic Information Space 

We want to present now a mathematical formalism suitable for modeling ge- 
netic code and DNA sequence. Let us first introduce an information space 
I as a subset of the set Z of usual integer numbers, where to each m G X 
is attached an information. Different numbers a, b G I contain different 
information. Let be valid standard arithmetic operations (summation, sub- 
traction and multiplication) on elements of I. 

Since an information can be more or less similar (or dissimilar) to another, 
there is a sense to introduce a mathematical tool to measure similarity (or 
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dissimilarity). Such a tool is a distance between the corresponding integers. 
But now arises a question: What kind of distance we should take between 
integers to describe closeness on the information space? Recall that there 
are two kinds of distances for integers: usual real (Archimedean) and p-adic 
(non-Archimedean, ultrametric) distance. We propose, for a class of X, to 
employ p-adic distance (defined in the preceding section), i.e. d p (a,b) = 
\a — b\ p , a, b G Z. As a consequence one has a quite natural property: two 
information are closer, i.e. with smaller distance, if they have more equal 
first digits in their p-adic expansion. One has also that digits which come 
later in the expansion have smaller importance (for a similar treatment of 
information see ^1]). In the sequel an information space with p-adic distance 
will be called p-adic information space. Some experimental properties of 
genetic code lead us to introduce p-adic genetic space Q v as a special case of 
p-adic X. An element m G Q p can be presented in the form 

rt 

m = ±p N ^niip 1 , rrii G {0 , 1 , • ■ • ,p - 1} , (5) 

i=0 

where N , n are nonnegative integers and m,j are digits. For a given p and N, 
information m is characterized by the sequence of digits tuq , m\ , ■ • ■ , m n . In 
other words, information is coded by ordered sequence of digits m,Q , mi , • ■ • , m, 
If integers a ,b G Q v have expansions 

a = a + aip + a 2 p 2 H , b = b + b\ p + b 2 p 2 H , (6) 

then d p (a, b) = p~ k if a = b , ■ • • , a^-i = bk-i and ^ b}~. Accordingly 
d p (a, b) = p~ k is smaller as k is larger and a , b are closer (i.e. more simi- 
lar). This p-adic closeness will be later exploited in analysis of genetic code 
degeneration, but now let us turn to the p-adic modeling of DNA. 

4 £>-Adic model of the DNA sequence 

To have an appropriate p-adic genetic space Q p that can describe DNA se- 
quence and genetic code, one has to choose the corresponding prime number 
p which will be used as a base for expansion. For the base in expansion of 
genetic information we choose p = 5, because 5 is the smallest prime number 
which contains four nucleotides (A , T , G , C) in DNA, or (A , U , G , C) 
in RNA, in the form of four different digits. At the first glance, because 
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there are four nucleotides, one could start to think that a 4-adic expansion, 
which has just four digits, might be more appropriate. However, note that 
4 is a composite integer and that related expansion is not suitable since the 
corresponding | • I4 absolute value is not a norm but a pseudonorm and it 
makes a problem with uniqueness of the distance between two points. To 
illustrate this problem let us consider, for instance, a distance between num- 
bers 4 and 0. Then we have 0^(0,4) = 1 4 1 4 = |, but on the other hand 
4(0,4) = |2| 4 |2| 4 = 1. 

Thus for four nucleotides, which appear in the strict complementarity 
between the two DNA strands, i.e. make two base pairs (A, T) and (C, G), 
we choose the corresponding 5-adic integer numbers to construct the cor- 
responding DNA sequence model. Namely, we attach digits (1, 2, 3, 4) to 
nucleotides (C, A, T, G) in the following way: 

C = 1, A = 2, T = 3, G = 4. (7) 

Recall that there are p digits in representation of a p-adic number. According 
to this approach, the digit does not play a role in the representation of 
single helicoidal chain and in the RNA coding. It is worth noting that we 
also considered some other choices of possible connection between nucleotides 
and four of the above five digits. However, we find that the choice (J2J) is the 
most suitable and attractive. 

In this way any of the DNA chains can be presented as a 5-adic number 
in the form 

x = 5 jv (x + xi 5+x 2 5 2 + --- + a; n 5 n ) , a* ^ , N E NU{0} , n E N , (8) 

where Xj = 1, 2, 3, 4 and n is an enough large natural number. This chain 
can be also presented as 

x = ^5^(x + xi5 + x 2 5 2 + ■■■ + x nj 5 n , iVi < N 2 < ■ ■ ■ N w , (9) 
3=1 

where a; is a number of subsequences, which encode and those which do not 
encode proteins, in a chain of the DNA. One can introduce 5-adic distance 
between genes and it will be characterized by 5 4 . 

For a simple illustrative example (N = 0, n = 10), to a chain of nu- 
cleotides 

a = ATGC AAGTGA (10) 
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corresponds 5-adic number 

a = 2 + 3-5 + 4-5 2 + l-5 3 + 2-5 4 + 2-5 5 +4-5 6 + 3-5 7 + 4-5 8 + 2-5 9 , (11) 

which can be written also using only its digits 

a = 2341224342. (12) 

According to this approach a DNA sequence can be presented as a sum of two 
5-adic integers. Let us denote DNA sequences by Greek letters a, (3, ■ ■ • and 
their chain components by Latin ones a, b, ■ ■ -. Then an a = a + b. In fact a 
and b are firmly correlated because of complementarity, i.e. b = a, where a 
obtains from a replacing digits (1, 2, 3,4) by (4, 3, 2 , 1), respectively. The 
corresponding a related to (JTUJ) is 

a = a + a = 2341224342 + 3214331213 = 0111111111 1, (13) 

where we performed summation of digits from the left to the right, taking 
1 + 4 = + 1- 5 and 2 + 3 = + 1-5. In this way the sum (fTHjl . which 
corresponds to an example of DNA, is presented in the very simple form: it 
is quite definite sequence of the digit 1, which is of the same length as DNA 
and shifted at one place on the right. 

One can easily check that integers a , a and a in (|13|) form vertices of an 
equilateral triangle whose all three sides have the same 5-adic length equal 
to 1. 

It is worth mentioning that human genome, which presents all genetic 
information of the organism, is composed of more than three billion base 
pairs and contains more than 30.000 genes. 

5 £>"Adic genetic code 

A living cell is a very complex system composed mainly of protein macro- 
molecules playing various roles. All those proteins are made of only 20 
amino-acids, which are the same for all living world on the Earth. Dif- 
ferent sequences of amino-acids form different proteins. An intensive study 
of connection between ordering of nucleotides in the DNA (and RNA) and 
ordering of amino-acids in proteins led to the discovery of genetic code. 

At the end of the 50th and beginning of the 60th of the last century many 
basic properties of genetic code were obtained. Genetic code is understood 



7 



as a dictionary for translation of information from the DNA (through RNA) 
to production of proteins by amino-acids. The information is contained in 
codons, which are ordered sequences of three nucleotides. There are three 
stop codons, and 61 codons are related to 20 amino-acids. There are various 
multiplicity (one, two, three, four and six) of codons which correspond to 
amino-acids in proteins, i.e. genetic code is degenerate. This is an well 
established experimental fact. 

However, there is no simple theoretical understanding of genetic coding. 
In particular, it is not clear why genetic code is just in the known way 
and not in many other possible ways. What is a principle (or principles) 
used in fixing mitochondrial and eukaryotic codes? What are properties of 
codons responsible for their appearance in quadruplets, sextets, doublets, 
and even in a triplet and a singlet. These are only some of many questions 
which can be asked about genetic code. Recall that the ribosome performs 
synthesis of proteins and it knows somehow very firmly which amino-acid 
corresponds to a given codon. In fact, the ribosome is a molecular machine 
which performs multiple functions, and one of them should be a computing 
of codon properties. 

Let us consider now possible answers to the above questions on genetic 
code starting from the 5-adic model. According to our approach, a codon in 
RNA is an integer number of the following form: 

c = Co + Cl 5 + c 2 5 2 , c , ci, c 2 e {1,2,3,4}, (14) 

where, without loss of generality, we take N = 0. In the RNA the nucleotide 
T is replaced by U and we remain the same digit (T=3) and take U=3. In 
this way there is no digit used in presentation of codons. 

Having the above choice of digits (i.e. C=l, A=2, U=3, G=4) we can 
now look at the Tables 1 and 2, and observe the corresponding ultramet- 
ric (5-adic and 2-adic) reason for formation of quadruplets and doublets. 
Codons are simultaneously denoted by three digits and capital letters. The 
corresponding amino-acids are presented in the usual three letters form. 
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111 CCC rro 

112 CCA Pro 

113 CCU Pro 

114 CCG Pro 


211 AbC Inr 

212 ACA Thr 

213 ACU Thr 

214 ACG Thr 


311 UCC Ser 

312 UCA Ser 

313 UCU Ser 

314 UCG Ser 


/111 r^r^ a i„ 

411 (orbC Ala 

412 GCA Ala 

413 GCU Ala 

414 GCG Ala 


1 o 1 ri a n tt* 

121 CAC His 

122 CAA Gin 

123 CAU His 

124 CAG Gin 


Ci o 1 A A /"I A 

221 AAC Asn 

222 UAA Lys 

223 AAU Asn 

224 AAG Lys 


321 UAC lyr 

322 UAA Ter 

323 UAU Tyr 

324 UAG Ter 


Am /~1 A /"I A 

421 GAC Asp 

422 GAA Glu 

423 GAU Asp 

424 GAG Glu 


lot bUL Leu 

132 CUA Leu 

133 CUU Leu 

134 CUG Leu 


2ol AUO lie 

232 AUA Met 
233 AUU He 
234 AUG Met 


OOI TTTTf" 1 T3U~ 

ooi UUL rne 

332 UUA Leu 

333 UUU Phe 

334 UUG Leu 


A O 1 PT TO A T~ 1 

4ol LrUU Val 

432 GUA Val 

433 GUU Val 

434 GUG Val 


141 CGC Arg 

142 CGA Arg 

143 CGU Arg 

144 CGG Arg 


241 AGC Ser 

242 AGA Ter 

243 AGU Ser 

244 AGG Ter 


341 UGC Cys 

342 UGA Trp 

343 UGU Cys 

344 UGG Trp 


441 GGC Gly 

442 GGA Gly 

443 GGU Gly 

444 GGG Gly 



Table 1 : The vertebral mitochondrial code 
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111 CCC rro 

112 CCA Pro 

113 CCU Pro 

114 CCG Pro 


211 AbC Inr 

212 ACA Thr 

213 ACU Thr 

214 ACG Thr 


311 UCC Ser 

312 UCA Ser 

313 UCU Ser 

314 UCG Ser 


/111 r^r^ a i„ 

411 (orbC Ala 

412 GCA Ala 

413 GCU Ala 

414 GCG Ala 


1 o 1 ri a n tt* 

121 CAC His 

122 CAA Gin 

123 CAU His 

124 CAG Gin 


Ci o 1 A A /"I A 

221 AAC Asn 

222 UAA Lys 

223 AAU Asn 

224 AAG Lys 


321 UAC lyr 

322 UAA Ter 

323 UAU Tyr 

324 UAG Ter 


Am /~1 A /"I A 

421 GAC Asp 

422 GAA Glu 

423 GAU Asp 

424 GAG Glu 


lol bUL Leu 

132 CUA Leu 

133 CUU Leu 

134 CUG Leu 


2ol AUO lie 

232 AUA He 

233 AUU He 
234 AUG Met 


OOI TTTTf" 1 T3U~ 

ooi UUL rne 

332 UUA Leu 

333 UUU Phe 

334 UUG Leu 


A O 1 PT TO A 7"^ 1 

4ol LrUU Val 

432 GUA Val 

433 GUU Val 

434 GUG Val 


141 CGC Arg 

142 CGA Arg 

143 CGU Arg 

144 CGG Arg 


241 AGC Ser 

242 AGA Arg 

243 AGU Ser 

244 AGG Arg 


341 UGC Cys 

342 UGA Ter 

343 UGU Cys 

344 UGG Trp 


441 GGC Gly 

442 GGA Gly 

443 GGU Gly 

444 GGG Gly 



Table 2 : The eucaryotic code 



Our observations are as follows. 

(i) Codons with the same first two digits have the same 5-adic distance 
equal to ^. This property leads to clustering of 64 codons into their 16 
quadruplets. Namely, any two codons a and b whose the first two digits are 
mutually equal and the third one is different, have 5-adic distance 

d 5 (a, b) = \a + a l 5 + a 2 5 2 - (a + a x 5 + b 2 5 2 )| 5 = \{a 2 -b 2 ) 5 2 | = 5' 2 , (15) 

where a , a\ , a 2 , b 2 e {1,2,3,4} and a 2 ^ b 2 . Since a and a\ may have 
four values, there are 16 quadruplets. 

(ii) With respect to 2-adic distance, the above clusters may be regarded as 
composed of two doublets: o = a ai 1 and b = o a\ 3 make the first doublet, 
and c = a a\ 2 and d = a ai 4 form the second one. 2-Adic distance between 
codons within each of these doublets is |, i.e. 

d 2 (a,b) = \(3-l)5 2 \ 2 = ^, d 2 (c, d) = |(4-2) 5 2 | 2 = i. (16) 



10 



(iii) Quadruplets which have at the second position digit 1 do not decay 
into two doublets. Each of these four quadruplets corresponds to the one of 
four different amino-acids. 

(iv) Quadruplets which have at the second position digit 2 decay into two 
doublets mentioned in (ii). Each of these eight doublets corresponds to the 
one of the new eighth different amino-acids. 

(v) The doublet structure of quadruplets which have at the second po- 
sition digit 3 or 4 becomes more complex and depend also on digit at the 
first place. Quadruplets with digits 13z, 4 3 i, 14i and 44 i , where 
% E {1,2,3,4}, are stable and have not substructure. However, for other 
four combinations od the first two digits the situation depends on the kind 
(mitochondrial or eukaryotic) of coding. The situation is simple for the ver- 
tebral mitochondrial code: quadruplets with digits 2 3 % , 33i , 2 4i and 
4 4i , where i G {1,2,3 ,4}, are not stable and decay into doublets. In the 
case of the eukaryotic (universal) code one has: quadruplet with digits 2 3 i 
decays into one lie-triplet (2 31, 232, 23 3) and one Met-singlet 2 3 4, while 
the quadruplet 3 4 i separates into one doublet and two different singlets. 

We would like to emphasize that codons ending on digits 1 and 3, and 
having 2-adic distance |, appear always together and determine the same 
amino-acid. 

6 Concluding remarks 

In this paper we proposed a new and simple model to investigate information 
aspects of DNA, RNA and genetic code. To this end, we introduced the 
corresponding p-adic information space and connected it with DNA when 
p — 5. 

An essential property of any p-adic space is ultrametric behavior of dis- 
tances between its elements, which radically differs from usual distances on 
a space of real numbers. It is significant that we attached just 5-adic integer 
numbers to the sequence of codons and not real integers in base 5. 

Classification of any set of objects is an ordering them into groups accord- 
ing to some their relations. Using 5-adic and 2-adic distances between codons 
we obtained their classification into quadruplets and doublets, respectively. 
As a result of the above analysis one obtains the following principle of genetic 
coding: p-adically close codons correspond to the same amino-acid. 

We plan to continue research on this model and to develop its formalism 
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as well as to apply it to more concrete cases. 
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